Lindera
A morphological analysis library in Rust. This project fork from kuromoji-rs.
Lindera aims to build a library which is easy to install and provides concise APIs for various Rust applications.
The following products are required to build:
- Rust >= 1.46.0
Usage
Put the following in Cargo.toml:
[dependencies]
lindera = { version = "0.30.0", features = ["ipadic"] }
Basic example
This example covers the basic usage of Lindera.
It will:
- Create a tokenizer in normal mode
- Tokenize the input text
- Output the tokens
use ;
The above example can be run as follows:
% cargo run --features=ipadic --example=tokenize_ipadic
You can see the result as follows:
日本語
の
形態素
解析
を
行う
こと
が
でき
ます
。
User dictionary example
You can give user dictionary entries along with the default system dictionary. User dictionary should be a CSV with following format.
<surface>,<part_of_speech>,<reading>
For example:
% cat ./resources/simple_userdic.csv
東京スカイツリー,カスタム名詞,トウキョウスカイツリー
東武スカイツリーライン,カスタム名詞,トウブスカイツリーライン
とうきょうスカイツリー駅,カスタム名詞,トウキョウスカイツリーエキ
With an user dictionary, Tokenizer
will be created as follows:
use PathBuf;
use ;
The above example can be by cargo run --example
:
% cargo run --features=ipadic --example=tokenize_ipadic_userdic
東京スカイツリー
の
最寄り駅
は
とうきょうスカイツリー駅
です
API reference
The API reference is available. Please see following URL:
- lindera-tokenizer